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ABSTRACT 

We  have  extended  the  Kaiser-Teager  algorithm  for  sep¬ 
arating  the  contributions  of  the  amplitude  modulation  and 
frequency  modulation  of  a  single  sinusoid  to  a  signal  con¬ 
sisting  of  multiple  components.  To  achieve  this  separation, 
we  use  an  instantaneous  non-linear  operator,  which  turns 
out  to  be  the  determinant  of  a  Toeplitz  matrix  formed  with 
the  signal  samples.  Because  of  its  instantaneously  adaptive 
nature,  we  can  use  this  algorithm  to  track  parameter  varia¬ 
tions  in  the  signal  components,  provided  these  variations  are 
not  too  rapid.  We  demonstrate  this  using  a  synthetic  signal 
containing  two  AM-FM  signal  components  and  a  speech  sig¬ 
nal.  We  also  point  out  the  method’s  relationship  to  Prony’s 
method. 


1.  INTRODUCTION 

In  many  applications,  such  as  speech  processing  [1,2],  radar 
signal  processing  [3],  one  can  model  the  observed  signal  as 
a  linear  combination  of  a  small  number  of  sinusoidal  sig¬ 
nals  that  arc  slowly  time-varying  in  both  amplitude  and 
frequency.  In  such  cases,  a  signal  consisting  of  m  compo¬ 
nents  may  be  parametrized  as 

m 

*{‘)  =  X^*(0  «*(**(<)).  (i) 

where  =<*'*  +  d’k(t).  u>*  is  the  nominal  carrier  fre¬ 

quency  of  the  t,h  component,  while  ^t(t)  and  At(t)  are  the 
time-varying  frequency  and  amplitude,  respectively.  Given 
the  signal  c(t),  one  wishes  to  determine  and  track  the  am¬ 
plitude  envelopes  and  the  frequency  trajectories  of  the  in¬ 
dividual  signal  components.  Traditional  short-time  Fourier 
transform  methods  are  not  always  successful  in  processing 
such  signals  due  to  their  limited  resolution.  The  so-called 
time-frequency  representations,  such  as  the  Wigner  distri¬ 
bution,  have  their  own  problems  with  multicomponent  sig¬ 
nal.  [4]. 

Recently,  Kaiser  [5,  €]  introduced  a  novel  approach  to 
track  the  frequency  and  amplitude  variations  of  a  single 
component  signal  by  applying  what  is  called  an  energy  op¬ 
erator.  Originally  devised  by  Teager,  the  discrete-time  ver- 

*  THIS  WORK  WAS  SUPPORTED  BY  AN  AIR  FORCE 
CONTRACT  AFOSR  4»tS0-ft-J-037t 


sion  of  the  energy  operator  ¥(•)  is  an  instantaneous  non¬ 
linear  function  of  the  samples  of  a  signal: 

=  *»  —  r„- itn+i .  (2) 

If  xn  represents  a  rinewave,  i.e.,  x„  =  AiCos(wm  +^j), 
whose  radian  frequency  u i  is  constant,  then  the  ‘energy* 
¥(*»)  a  A*  sin*  b/)  [5],  Observe  that  this  quantity  is  time- 
invariant.  Using  this  functional  dependence  of  energy  on 
the  frequency  and  amplitude  of  a  sinusoidal  signal,  Kaiser 
and  his  collaborators  [7]  devised  methods  to  separate  the 
contribution  of  the  amplitude  and  frequency  of  a  signal. 
They  then  applied  this  method  to  approximately  deter¬ 
mining  the  amplitude  and  frequency  variations  of  an  am¬ 
plitude/frequency  modulated  (AM-FM)  signal.  Although 
their  method  is  computationally  simple  and  instantaneously 
adaptive  in  nature,  it  cannot  be  directly  applied  to  track¬ 
ing  amplitude  and  frequency  variations  of  signals  composed 
of  multiple  components  as  in  (1),  For  such  multicomponent 
signals,  Kaiser  [5]  advocates  first  separating  the  signals  into 
frequency  components  by  filtering. 

In  this  paper  we  extend  the  method  in  [7]  to  simulta¬ 
neously  tracking  the  amplitude  and  frequency  variations  of 
multicomponent  signals.  This  is  accomplished  by  using  cer¬ 
tain  instantaneous  non-linear  operators  on  the  signal  sam¬ 
ples.  These  can  be  motivated  by  viewing  the  Teager  energy 
operator  'i(tn)  as  the  determinant  of  a  matrix: 


As  mentioned  before,  this  determinant  is  time-invariant  for 
a  single  sinusoidal  signal  with  constant  frequency.  We  have 
shown  [8]  that  this  time-invariance  property  of  the  deter¬ 
minant  can  be  extended  to  an  appropriately  constructed 
higher-order  m  x  m  matrix,  whose  elements  are  the  sam¬ 
ples  of  a  sum  of  m  sinusoids.  A  similar  result  has  been  ob¬ 
tained  for  continuous-time  signals  as  well  [8],  In  section  2 
we  have  outlined  this  result.  In  section  3  we  show  how  these 
results  can  be  related  to  the  energy  separation  algorithms 
derived  by  Maragos,  Quatieri,  and  Kaiser  (7).  Further,  in 
section  4,  using  the  functional  dependence  of  these  deter¬ 
minants  on  the  frequencies  and  amplitudes  of  the  sinewave 
components,  we  describe  the  two-component  AM-FM  sep¬ 
aration  algorithm.  We  obtain  explicit  expressions  for  the 
instantaneous  frequencies  of  two  (real-valued)  component 
signals.  This  can  be  extended  to  up  to  four  components 


*(*„)  =  Det 
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However,  beyond  font  component*  one  has  to  root 
j*pefynomi*l  to  obtain  the  frequencies.  Thi*  is  reminiscent 
f  well-known  Prorfy’s  method  [9]. 

|.  TIME  INVARIANCE  Of  THE 
DETERMINANT  or  A  toeplitz  matrix 

Itct  *•»  be  the  (ample*  of  &  linear  combination  of  m 
fejmple*' valued,  dttcrete-time  (iauaoidal  signal*  with  arbi- 
i^y  phases  and  non-sero  amplitudes,  ».e., 


5> 


.!(•*"+  Sa) 


M 


ksl 


The  distinct  radian  frequencies  w*  €  (0,  3t)  have  no  par¬ 
ticular  relationship  between  them.  Consider  the  following 
mxm  Toeplitx  matrix  Xm(n)  formed  with  the  elements  of 
the  sequence  *»: 


Xm(")  — 


c» 

*»-l 


*«+i 


£n+m— 1 
Xn+m- 2 


L  I  n— 1  Xt»— m+2 


•  (5) 


Using  the  definition  of  zn  in  (4),  we  can  decompose  Xm(n) 
into  a  product  of  three  matrices  to  facilitate  the  compute* 
lion  of  its  determinant,  Ax.  where  the  dependence  on  m 
and  n  is  not  explicitly  shown  for  notational  simplicity.  This 
decomposition  may  be  written  as 


X,(.)  «  VA(n)V*, 


w 


The  superscript  ‘H'  denotes  Hermitian  transpose.  The  di¬ 
agonal  matrix  A(n)  has  Xu  —  k  =  1, 3, . . .  ,m 

as  its  diagonal  entries.  V  is  a  Vandermonde  matrix  [10] 
with  entries  v*i  =s  1,1  «  1,2,.,  ,m.  Note  that 

the  diagonal  matrix  in  the  middle  is  the  one  that  shows 
dependence  on  the  time  index.  The  determinant  of  V  is 
Hr,.,  4<J(e***  —  e**'1)  [10]  and  the  determinant  of  A(n)  is 
nr=j  Aae^w‘"+n).  Bence,  after  some  simplification, 


Ax  -  ft  A*  «*-*♦*>  ft  4  sin1  (^i)  .  (7) 

kal  V«5* 

Since  the  complex- valued  factor  in  the  above  formula  has 
unit  magnitude,  we  observe  that  the  magnitude  of  Ax  is 
invariant  with  the  time  index. 

If  the  aignal  is  composed  of  m/3  real-valued  sinusoids, 

*»  “  ^7-1  -4*  «*(<•>»*  .  then  Uk+n/t  *  — w» 

und  da^m/a  *  -da  for  k  =  1,3,  ...,m/3.  This  leads  to 
further  simplification  of  (7)  for  Ax: 

m/J 

A\  iin3  w* 

k»\ 

ff  (‘*,(afa)),(**,(afa))'w 

h,|«l 

»<» 
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Note  that  this  determinant  is  always  positive,  and  of  course, 
time-invariant.  The  essence  of  the  above  results  can  be  ex¬ 
pressed  in  words  as  follows:  For  a  sequence  *ft,  composed 
of  a  sum  of  m  complex-valued  einbwtves  with  arbitrary  fre¬ 
quencies,  phases,  and  amplitudes,  there  exists  an  instanta¬ 
neous  non-linear  function  involving  (2m  —  1)  samples  that 
takes  on  a  constant  magnitude  anywhere  on  the  time-index 
ass.  A  a»«»ila»  statement  is  true  for  real-valued  ainewavea. 
This  appears  to  be  a-  useful  fact  as  demonstrated  below. 
Similar  results  can  also  be  obtained  for  continuoua-time  sig¬ 
nals  [8]. 

Now  assume  that  the  sequence  x„  in  (4)  is  passed 
through  a  linear  time-invariant  filter  with  impulse  response 
A„  i/(eJ").  Then  the  output  sequence  y„  may  be  writ¬ 
ten  as 

m 

Vn  =  '£)A'kt^n^)  ,  (9) 

kml 

where  A't  m.  As  H («■'“*)•  This  result  follows  from  the  fact 
that  complex  exponentials  are  eigenfunctions  for  a  linear 
time-invariant  Alter.  If  we  construct  an  m  x  m  Toeplitz 
matrix  Ym(n)  (aimilax  to  Xm(n)  in  (5))  with  the  elements 
of  the  sequence  y„,  since  the  expression  for  y„  is  akin  to 
that  of  x.  in  (4),  we  can  write  down  the  determinant  of 
Ym(i»)  using  (7)  as 


Ay  -  ft  Ai  «*—•♦**>  ft  ^  ’  (10) 

»•!  a.iwi 

*<■ 

Therefore,  the  determinants  of  Ym(n)  and  Xm(n)  are  re¬ 
lated  aa  follows: 

m 

Ay^AxU^)  (H) 

Ski 

Next,  we  show  the  relationship  of  the  above  results  to  the 
Discrete  Energy  Separation  Algorithms  (DESA)  of  Marago s 
€t  ol.  [7], 

3.  RELATION  TO  THE  DESA  ALGORITHMS 

In  [7]  Marago*  et  ol.  have  proposed  two  algorithms,  vit., 
DESA-3  and  DESA-1,  to  separate  amplitude  modulation 
from  frequency  modulation  using  the  energy  operator.  Us¬ 
ing  the  results  derived  in  the  previous  section  we  first  show 
how  we  can  obtain  their  algorithms.  Extension  to  more 
than  one  component  is  discussed  in  the  next  two  sections. 

Let  sn  m  Ai(n)cos(u>,(n)  +  di)  correspond  to  a  single 
AM-FM  signal.  Let  us  assume  that  A,(n)  and  u»,(n)  are 
only  slowly  varying  and  that  they  can  be  approximated  as 
constants  over  any  consecutive  three-sample  period.  Since 
(.  consists  of  two  complex  exponentials  (m  =  3),  from  (8) 
Ax  33  A,(«)  sin*[wj(n)].  Next,  filter  xn  through  a  filter 
with  impulse  response  A„  —  {^,  0,  -£}  $(«,“'  —  «-JW) 

to  yield  y»-  Using  (11),  Ay  is  found  to  be 

Ay  a  Ax 

.  Aj(»)  sin’M»)]  I|e^'(",-e-^'l“,|* 

-  <4?(n)  ain4[u»,(»))  .  (12) 


04 
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which  immediately  lead*  to  DESA-2.  To  obtva  DESA-1 
tee  observe  that  y„  is  the  output  of  a  filter  whose  transfer 
function  it  /fj(eJ*')  **  1  while  *<*  »»  the  result  of 

filtering  tn  by  /fj(eJ")  =  e1"  —  1,  and  applying  (11)  yield* 
Eq.  (S)  in  [T].  Thu*,  the»e  two  of  the  algorithm*  proposed  in 
[7]  can  be  obtained  a*  special  case*  of  the  above  expression*. 

The  algorithm*  devited  in  (7]  have  the  disadvantage  of 
being  applicable  to  only  single-component  AM-FM  signals. 
Using  (11)  we  next  show  how  to  separate  amplitude  from 
frequency  modulation  in  the  case  of  a  two-component  AM- 
FM  signal. 

4.  A  TWO-COMPONENT  AM-FM  SIGNAL 
SEPARATION  ALGORITHM 

We  now  extend  the  above  algorithm  to  separate  a  (real¬ 
valued)  two-component  AM-FM  signal.  Let  r«  = 
Ai(n)co*(wi(n)  -f  di)  +  Aj(n)cos(u>j(n)  -f  dr)-  For  this 
case,  m  at  4.  Next,  we  filter  *„  by  »  1  +  e-,“  to 

get  y5,,)  and  by  2fj(e*“)  a  1  —  e",w  to  get  yS,**.  A*  before, 
we  assume  that  A,(n)  and  w,(n),  i  a  1,3  are  only  slowly 
varying  so  that  they  can  be  approximated  as  constants  over 
any  period  of  2m  =  8  samples.  Hence,  as  before,  using  (11) 


Ay,  _ 

|l  +  j* 

|l  +  e-'-ri") 

Ax 

Ay,  _ 

|1 

Ax  ~ 

for  l  <  k  <  m.  Or,  using  matrix  notation, 


«r 

oj 


oj 


°t 

•a 


\  ®  m 


m  / 
m  / 


*» 


Om 


\  t„ 


fr-' 


% 

(16) 

where  6a  is  the  (k  4-  1)"  term  in  the  expansion  of 
FIui  0  +  «■>*’'(">).  Since  the  sa’s  are  distinct,  the  Vander¬ 
monde  matrix  on  the  left-hand  side  is  non-singular.  Hence 
(16)  can  be  solved  by  matrix  inversion.  The  frequencies 
Uk(n)  are  easily  seen  to  be  the  angles  of  the  roots  of  the 
polynomial  1  +  i  Once  again,  the  assumption  is 

that  the  parameters  of  the  AM-FM  signal  can  be  approxi¬ 
mated  as  constants  over  any  2m-s*mple  interval. 


«.  SIMULATIONS  AND  DISCUSSION 

As  an  illustrative  example  of  the  above  techniques,  consider 
the  fallowing  two-component  AM-FM  signal  of  the  form 
s(n)  =  Ai(n)si(n)  +  Aj(«)sj(n),  where 


*i(»)  = 


cot 

cos 


[0.25  *n  +  3& 


[  0.35arn  —  1— 


n  =  0,1,..., 200 
n  =  201,  202,...,  399 


*,(n)  = 


cot 

cot 


0.4rn  •+•  ■ 


ns  0,1,...,  199 
n  =  200,  201,...,  399 


Simplifying,  we  obtain 

1  /Ay,  wi(n)-w*(n)  <*>i  (n)  +  wj(n) 

2V  Ax  “  co*  2 - —  4- cos— - —  ,(13) 

1  /A Fa  „  cM  wi(n) -w?(n)  cojWi(n)-hwa(n) 

2  V  Ax  2  2 

From  the  above  two  equation*  we  can  solve  fox  t»i(n)  and 
uj(n),  respectively.  Once  the  frequencies  have  been  com¬ 
puted,  the  amplitudes  can  be  obtained  by  solving  a  set  of 
linear  equations  in  the  least-squares  sense. 

8.  MULTICOMPONENT  AM-FM 
SEPARATION  ALGORITHM 

The  AM-FM  separation  algorithm  developed  in  the  previ¬ 
ous  section  can  be  generalized  to  deal  with  an  m-component 
signal.  To  this  end,  consider  m  distinct  length-two  filters, 
with  transfer  functions  of  the  form  Hk{t3“)  =  1  +  «se,*> 
1  The  a*  arc,  in  general,  complex.  Denote  by 

the  output  of  the  filter  /f»(eJW)  with  x„  as  its  input.  Then, 
applying  (11),  the  determinant*  of  the  m  x  m  Tocplitz  ma¬ 
trices  formed  from  the  samples  of  r„  and  yi**  are  related 

by 

wi 

Ay,  *  Ax  •  IJ  (l  +  «*  «,-<*>)  (15) 


*<*>  -  I)  „_M . 3M 

A»(n)  =  1  -  0  25co*  (|^  +  ep) 

The  overall  signal  *(n)  is  shown  in  Fig.  1(a).  The  estimated 
frequency  track*  are  shown  in  Fig.  1(b).  The  amplitudes, 
obtained  by  solving  a  set  of  linear  equations  in  the  least- 
squares  sense,  are  shown  in  Fig.  1(c).  In  solving  for  the 
frequencies,  we  found  that  (13)  and  (14)  led  to  numerical 
difficulties  even  in  the  presence  of  small  amounts  of  noise. 
This  is  because  the  argument  of  the  inverse  cosine  function 
did  not  always  have  magnitude  less  than  unity.  On  the  other 
hand,  frequency  estimates  obtained  by  rooting  a  polynomial 
Were  found  to  be  more  robust.  Hence  this  approach  was 
used. 

We  next  applied  this  algorithm  to  speech  data  cor¬ 
responding  to  the  phoneme  /oo/  (16  kHz  sampling  fre¬ 
quency).  The  data  contained  four  formant  frequencies 
around  500  Hz,  1200  Hz,  2300  Hs,  and  3100  Hz.,  respec¬ 
tively.  Subsequent  low-pass  filtering  eliminated  the  third 
and  the  fourth  formant  frequencies.  Fig.  2(a)  shows  the  fil¬ 
tered  speech  signal.  We  used  a  model  order  m  =  8  (four  real 
sinusoids)  on  this  data.  Fig.  2(b)  show*  the  first  two  for¬ 
mant  frequency  tracks,  after  smoothing  by  an  eleven-point 
median  filter.  In  Fig.  2(c)  the  corresponding  least-squares 
estimates  of  the  amplitude  envelopes  are  shown,  which  used 
the  median-filtered  frequencies.  Even  though  the  filtered 
signal  has  only  two  dominant  components,  a  model  order  of 

BW«  thank  Dr.  Shubha  Kadambe  of  A.  I.  duPont  Institute, 
DE,  for  supplying  us  the  speech  data  used  in  our  simulations. 
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4  (two  real  sinusoids)  yielded  poor  estimates  of  the  for¬ 
mat  frequencies.  Increasing  the  order  to  m  ®  6  improved 
estimates  noticeably;  the  estimates  were,  however,  still 
soby-  A  model  order  m  ™  8  was  the  smallest  order  required 
yield  reasonable  results  for  this  example. 

7.  CONCLUSIONS 

We  have  extended  the  Kaiser-Teager  method  to  multicom¬ 
ponent  signals.  Even  though  this  approach  appears  to  be 
different  from  traditional  frequency  estimation  algorithms, 
it  bears  a  close  resemblance  to  the  Prony's  method  in  that 
it  requires  polynomial  rooting  to  estimate  the  component 
frequencies,  particularly  if  m  is  larger  thin  four  (for  com¬ 
plex  signals)  or  eight  (for  real  signals).  It  seems  hard  to 
acape  from  the  dutches  of  Prony’s  method.  There  is  scope 
for  improving  the  method’s  performance  in  the  presence  of 
noise  by  choosing  the  Hk(c*u)  appropriately. 
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Fig.  2  (a)  Signal  from  speech  vowel  /oo/.  filtered  to  retain  only  the  first  two  formant  frequencies,  (b)  Estimated  frequency 
tracks  of  the  first  and  second  formant  frequencies,  smoothed  by  an  11-point  median  filter.  Assumed  model  order:  m  =  8. 
(c)  Estimated  amplitude  envelopes  obtained  via  least- squares  solution  that  utilized  the  median-filtered  frequencies. 
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ABSTRACT 

We  propose  an  improved  method  to  model  voiced 
speech  signals.  First,  we  describe  a  method  to  accurately 
model  the  signals  using  a  linear  combination  of  harmoni¬ 
cally  related  sinewaves.  The  method  fits  a  linear  combi¬ 
nation  of  sines  and  cosines  whose  frequencies  are  integer 
multiples  of  the  unknown  fundamental  (pitch)  frequency 
to  the  speech  data  in  the  least-square  sense.  The  ampli¬ 
tudes  of  the  sinewaves  and  the  fundamental  frequency  are 
the  unknowns  and  are  determined  simultaneously  using  the 
least-squares  fit.  Using  our  method,  we  show  how  one  can 
obtain  smoothly  varying  frequency  and  amplitude  tracks 
for  all  the  harmonics  and  thus  model  the  speech  signal  par¬ 
simoniously.  After  obtaining  the  harmonic  decomposition, 
we  regard  the  time-varying  amplitudes  of  the  cosinusoidal 
and  sinusoidal  harmonic  components  as  the  real  and  imag¬ 
inary  parts  of  the  complex-valued  frequency  responses  of 
the  slowly  time- varying  filter  representing  the  vocal  tract 
and  glottal  excitation  pulse  generator,  in  cascade.  We  then 
fit  a  sequence  of  all-pole/pole-zero  models  to  the  complex 
frequency  response  values. 


1.  INTRODUCTION 

Voiced-segments  constitute  a  significant  portion  of  speech 
signals.  In  many  applications,  it  is  important  to  extract  fea¬ 
tures  such  as  the  pitch  frequency  and  the  vocal  tract  trans¬ 
fer  function  from  these  segments  accurately,  even  when  the 
speech  signal  is  corrupted  by  noise.  Usually,  short-time 
Fourier  transform  (STFT),  linear  prediction  (LP)  or  cep- 
stral  methods  are  used  to  extract  these  features. 

Voiced-speech  signals  are  often  modeled  as  the  output  of  a 
slowly  time- varying  linear  filter  representing  the  vocal  tract, 
excited  by  a  quasi-periodic  glottal  pulse  train.  If  the  glottal 
pulse  train  were  indeed  exactly  periodic,  it  can  be  repre¬ 
sented  by  a  Fourier  series  with  the  fundamental  frequency 
corresponding  to  the  pitch  frequency,  which  is  given  by  the 
reciprocal  of  the  period  of  the  pulse  train.  Since  the  pulse 
train  is  only  quasi-periodic,  the  voiced  speech  waveform 
may  be  modeled  by  a  sum  of  harmonically  related  sinewaves 
with  slowly  varying  fundamental  frequency,  with  arbitrary 
amplitudes  and  phases.  Many  authors,  perhaps  starting 
with  Flanagan,  have  observed  this  feature  and  taken  ad¬ 
vantage  of  it.  Recently,  McAulay  and  Quatieri  [l,  2,  3,  4} 
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(see  also  the  references  therein)  have  carried  out  extensive 
work  in  modeling  both  voiced  and  unvoiced  speech  by  us. 
ing  a  linear  combination  of  sinusoidal  signals.  They  have 
applied  it  to  speech  coding,  co-charnel  interference  suppres- 
sion,  and  time  scaling  of  speech.  The  algorithms  proposed 
in  the  above  references  rely  primarily  on  the  STFT  (or  some 
modification  of  it)  to  obtain  the  sinewave  decomposition. 

In  this  paper,  our  primary  contributions  are  two-fold: 

•  We  describe  a  method  to  accurately  estimate  the  fun¬ 
damental/pitch  frequency  and  the  amplitudes  of  the 
harmonically  related  sinewaves  simultaneously,  using  a 
direct  least-squares  fit  to  the  speech  data.  Such  meth¬ 
ods  are  well  known  to  model-based  spectral  analysis 
practitioners  but  appears  not  to  have  been  used  in 
speech  analysis.  Wt  apply  this  method  to  a  speech 
segment  over  short,  possibly  overlapping  windows.  Un¬ 
like  in  [l),  we  do  not  assume  that  the  analysis  window 
be  an  integer  multiple  of  the  pitch  period  or  use  the 
STFT  pe  Jvt  to  determine  the  parameters.  Also,  we  do 
not  employ  the  pitch  synchronous  analysis  advocated 
in  [5j.  Using  our  method  we  show  how  one  can  ob¬ 
tain  smoothly  varying  frequency  and  amplitude  tracks 
for  all  the  harmonics  and  thus  model  the  speech  signal 
parsimoniously.  This  method  is  in  fact  the  maximum- 
likelihood  method,  if  the  background  noise  is  white  and 
the  assumed  signal  model  is  valid.  Therefore,  if  the 
speech  signal  is  corrupted  by  noise,  it  may  be  advanta¬ 
geous  to  estimate  the  harmonic  components  first  using 
our  method  and  then  use  them  as  ‘clcaned-up’  data  for 
further  modeling  of  the  vocal  tract  etc. 

•  After  obtaining  the  harmonic  decomposition,  we  regard 
the  time-varying  amplitudes  of  the  cosinusoidal  and  si¬ 
nusoidal  harmonic  components  as  the  real  and  imagi¬ 
nary  parts  of  the  complex-valued  frequency  responses 
of  the  time-varying  filter,  representing  the  vocal  tract 
and  glottal  excitation  pulse  generator,  in  cascade.  We 
assume  that  this  cascaded  filter  is  slowly  time-varying. 
We  then  fit  a  sequence  of  transfer  function  models  to 
the  complex  frequency  response  values. 

3.  ACCURATELY  ESTIMATING  THE 
HARMONIC  COMPONENTS 

Let  us  assume  that  a  block  of  N  samples  of  xn,  n 
0, . . . ,  N—  1,  is  to  be  modeled  by  a  signal  s«  consisting  of  M 
harmonically  related  sinewaves  with  unknown  amplitudes 
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nd  fundamental  frequency  up.  Af  is  assumed  known. 

M 

s„  »  ^A*cos(u>0tn)  +  5ksin(woin)  .  (1) 

kml 

Ve  wish  to  minimize  the  sum  of  squared  errors  by  choosing 
unknown  amplitudes  and  the  frequency  wo. 

n-\ 

^  =  (2) 

n*0 

n  matrix-vector  notation,  let  x  =  (*o,  *i,  -  •  ■  ,*>r-t)T  and 
5  »  (so.  St, . . . ,  s//-t)T  denote  the  data  and  signal  model 
vectors,  respectively.  Using  the  model  given  in  (1)  we  can 
write  the  signal  vector  s  as 

s  =  Wa,  (3) 

where  a  is  the  2M  x  1  vector  of  unknown  amplitudes  a  = 

(.4),  .4;,  .  • ,  Ajw,  f?i,  Bj, . . . ,  B*r)T  and  W  is  an  N  x  2 M 
matrix,  whose  (fc,l)-th  element  is  given  by 

cos(i/w  o)  1  =  1,2,...,  Af 

sin(l(l- A/)wo)  /  =  M  +  1.J/  +  2,..., 2M 

for  i  =  0, 1 . N  —  Using  this  notation  we  can  rewrite 

the  error  £  as 

£  =  l|x-Wa||*  (4) 

Since  both  W  and  a  are  unknown,  this  problem  is  a  bilinear 
least-squares  problem.  Such  problems  have  been  dealt  with 
in  numetical  analysis  and  spectral  analysis  literature  for  the 
past  20  years  [6]  (equation  16.152).  The  standard  trick  is  to 
assume  that  wo  is  known  and  then  solve  the  least-squares 
problem  for  the  best  amplitudes.  For  a  given  w0  the  best 
amplitude  a  is  given  by 

a  =  (WrW)-lWTx  .  (5) 

Substituting  this  value  of  a  back  into  the  error  expression 

in  (4)  gives 

E  =»  xT(I  -  W(WTW)-lWT)x,  (6) 

where  we  have  used  the  fact  that  the  projection  matrix 
(I- W(WTW)~1  Wr)  is  idempotent.  Note  that  E  now  de¬ 
pends  explicitly  on  the  unknown  wo  only.  This  error  can  be 
minimized  by  a  coarse  search  over  wo  followed  by  a  gradient 
descent  procedure.  We  have  derived  explicit  expressions  for 
the  gradient  and  the  Hessian,  which  are  given  in  (7).  Once 
the  best  wq  that  minimizes  E  is  found,  the  corresponding 
amplitudes  can  be  obtained  from  (5). 

We  mention  two  closely  related  problems.  In  the  case  of 
co-channel  interference  suppression  [2],  that  is  separating 
speech  signals  from  two  different  speakers  that  have  been 
added  together,  the  same  modeling  procedure  as  above  can 
be  used.  Except,  in  this  case,  e„  in  (1)  will  be  modeled 
as  a  linear  combination  of  two  sets  of  harmonically  related 
sinewaves  with  two  different  fundamental  frequencies.  Now 
the  error  E  in  (6)  will  have  to  be  minimized  over  two  inde¬ 
pendent  fundamental  frequencies.  In  other  situations,  such 


as  modeling  unvoiced  segments,  the  modeling  procedure  can 
be  modified,  to  least-square  fit  a  linear  combination  of  si¬ 
nusoids  which  are  not  necessarily  harmonically  related,  to 
the  data.  In  this  case  it  may  be  necessary  to  filler  the  data 
into  two  or  more  frequency  bands  to  reduce  the  numbex  of 
sinusoids  needed  for  the  fit  in  each  band. 

s.  analysis  of  voiced-segments 

USING  PROPOSED  METHOD 

Fig.  1  shows  3S00  samples  of  the  phoneme  /oo/.  The  data 
was  sampled  at  16  kHz.  Before  applying  the  above  algo¬ 
rithm  to  the  speech  data  we  low-pass  filtered  the  data  to 
about  1000  Hz  and  down-sampled  by  4,  for  the  following 
reasons: 

•  The  low-pass  region  from  0-1  kHz  contains  the  major 
portion  of  the  signal  energy, 

•  In  this  region  the  number  of  harmonic  components  with 
significant  energy  is  likely  to  be  small  i.e.,  of  the  order 
of  10  or  less  and  down-sampling  reduces  the  required 
computation. 

•  Often  the  pitch  frequency  u>0  varies  slowly  with  time. 
This  causes  the  harmonic  components  (some  integer 
multiple  of  wo)  in  the  high  frequency  range  (say,  near 
3000  Hz)  to  sweep  rapidly  in  frequency.  We  wish  to 
exclude  such  components  in  our  modeling,  because  tne 
model  in  (l)  is  less  valid  for  such  components- 

Fig.  2  shows  the  magnitude  of  the  Fourier  transform  of  the 
entire  signal  prior  to  filtering  and  down-sampling. 

Next,  we  applied  the  algorithm  described  in  section  2  to 
estimate  wo  on  overlapping  blocks  of  data.  Fig.  3(a)  shows 
the  error  £  as  a  function  of  the  possible  candidate  wo ’s  for 
the  initial  part  of  speech  data.  To  find  the  minimum  we  first 
performed  a  coarse  search  to  get  a  good  initial  guess  and 
then  used  a  gradient  descent  procedure  [7]  to  find  the  global 
minimum  of  E  which  gave  the  best  u>a  estimate.  Fig.  3(a) 
also  shows  the  effect  of  block  size  on  £  as  it  is  changed  from 
one  pitch  period  (32  samples)  to  about  three  pitch  periods 
(96  samples).  Note  that  the  valley  with  the  global  minimum 
gets  deeper  a*  the  size  of  the  block  increases.  However,  as 
the  block  size  is  increased,  the  optimal  wo  also  increases, 
because,  in  this  example,  thfc  pitch  frequency  is  slightly  in¬ 
creasing  with  time.  Fig.  3(b)  shows  the  value  of  £  as  a 
function  of  the  number  of  assumed  harmonic  components 
M  (M  varying  from  2  to  5)  for  a  fixed  block  size  of  64 
samples.  Observe  that  the  location  of  the  minimum  does 
not  change  much  when  M  is  chosen  greater  than  2.  This 
shows  that  the  precise  value  of  M  may  not  be  that  critical 
while  estimating  u>o.  Also  observe  that  the  DFT  magnitude 
of  the  data  clearly  shows  four  or  five  distinct  peaks  in  the 
frequency  region  from  0  to  1000  Hz. 

The  above  method  for  estimating  uo  is  applied  to  con¬ 
tiguous  overlapping  blocks  of  data.  The  block  size  was  64 
samples.  The  overlap  was  60  samples.  Fig.  4  shows  the 
pitch  frequency  track  thus  obtained  and  its  multiples,  as  a 
function  of  time  (the  dotted  and  solid  curves;  dotted  curves 
have  been  used  for  some  harmonics  for  ease  of  visualization). 
Next,  we  also  estimated  the  frequencies  of  the  underlying 
sinusoidal  components  without  assuming  that  the  sinewaves 
are  harmonically  rotated.  This  was  done  by  a  least-square* 
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